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Abstract 

De Brujin graphs are widely used in bioinformatics for processing 
next-generation sequencing data. Due to a very large size of NGS 
datasets, it is essential to represent de Bruijn graphs compactly, and 
several approaches to this problem have been proposed recently. In 
this note, we show how to reduce the memory required by the al- 
gorithm of [2] that represents de Brujin graphs using Bloom filters. 
Our method requires 25% to 42% less memory with respect to the 
method of [2], with only insignificant increase in pre-processing and 
query times. 



1 Introduction 

Modern Next- Generation Sequencing (NGS) technologies generate huge vol- 
umes of short nucleotide sequences (reads) drawn from the DNA sample 
under study. The length of a read varies from 35 to about 400 basepairs 
(letters) and the number of reads may be hundreds of millions, thus the total 
volume of data may reach tens or even hundreds of Gb. 

Many computational tools dealing with NGS data, especially those de- 
voted to genome assembly, are based on the concept of de Bruijn graph, see 
e.g. [5]. The nodes of the de Bruijn graph are all distinct k-mers occurring 
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in the reads, and two k-mers are linked by an edge if they occur at successive 
positions in a reacO- In practice, the value of k is between 20 and 50. 

The idea of using de Bruijn graph for genome assembly goes back to the 
"pre-NGS era" [6]. Note, however, that de novo assembly is not the only 
application of those graphs [I] . 

Due to a very large size of NGS datasets, it is essential to represent de 
Bruijn graphs compactly. Recently, several papers have been published that 
propose different approaches to compressing de Bruijn graphs [3j [7J [2j [TJ . 

In this note, we focus on the method based on Bloom filters proposed in 
[2]. Bloom filters provide a very space-efficient representation of a subset of a 
given set (in our case, a subset of fc-mers), at the price of allowing one-sided 
errors, namely false positives. The method of [2] is based on the following 
idea: if all queried nodes (k-mers) are only those which are reachable from 
some node known to belong to the graph, then only a fraction of all false 
positives can actually occur. Storing these false positives explicitly leads 
to an exact (false positive free) and space-efficient representation of the de 
Bruijn graph. 

In this note we show how to improve this scheme by improving the repre- 
sentation of the set of false positives. We achieve this by iteratively applying 
a Bloom filter to represent the set of false positives, then the set of "false false 
positives" etc. We show analytically that this cascade of Bloom filters allows 
for a considerable further economy of memory, improving the method of [2j. 
Depending on the value of k, our method requires 25% to 42% less memory 
with respect to the method of [2]. Moreover, with our method, the memory 
grows very little with the growth of k. Finally, the pre-processing and query 
times increase only insignificantly compared to the original method of [2]. 

2 Cascading Bloom filter 

Let Tq be the set of k-mers of the de Brujin graph that we want to store. The 
method of [2] stores To via a bitmap B\ using a Bloom filter, together with 
the set Ti of critical false positives. Ti consists of those k-mers which are 
reachable from T by a graph edge and which are stored in B\ "by mistake", 

^^Note that this actually a subgraph of the de Bruijn graph under its classical combi- 
natorial definition. However, we still call it de Bruijn graph to follow the terminology 
common to the bioinformatics literature. 
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i.e. which belong to B\ but are not in T H B\ and Ti are sufficient to 
represent the graph provided that the only queried &-mers are those which 
are adjacent to &-mers of To. 

The idea we introduce in this note is to use this structure recursively and 
represent the set T% by a new bitmap B2 and a new set T2, then represent T2 
by S3 and T 3 , and so on. More formally, starting from Bi and Ti defined as 
above, we define a series of bitmaps Bi, B 2 , . . . and a series of sets Ti, T 2 , . . . 
as follows. B2 will store the set Ti of critical false positives using a Bloom 
filter, and the set T2 will contain "true nodes" from To that are stored in B2 
"by mistake" (we call them false 2 positives). B3 and T3, and, generally, Bi 
and Tj are defined similarly: Bi stores fc-mers of Tj_i using a Bloom filter, 
and Tj contains /c-mers stored in Bi "by mistake", i.e. those /c-mers that do 
not belong to Tj_i but belong to Tj_2 (we call them false 1 positives). Observe 
that T n Ti = 0, T D T 2 D T 4 . . . and Ti D T 3 D T 5 

The following lemma shows that the construction is correct, that is it 
allows one to verify whether or not a given k-mer is belongs to the set T . 

Lemma 1 Given an element (k-mer) K , consider the smallest i such that 
K £" B i+ i (if K £ B\, we define i = 0). Then, if i is odd, then K £ T 0; and 
if i is even (including zero), then K ^ To. 

Proof: Observe that K SjL Bj+\ implies K (jL Tj by the basic property of 
Bloom filters. We first check the Lemma for i = 0,1. 
For i = 0, we have K £" B\, and then K £" To. 

For i — 1, we have K £ T?i but K ^ B2. The latter implies that K ^Ti, 
and then must be a false 2 positive, that is K £ To. Note that here we 
use the fact that the only queried fc-mers K are either nodes of T or their 
neighbors in the graph (see [2J), and therefore if K £ Bi and K £" T then 
if £ Ti. 

For the general case i > 2, we show by induction that K £ Tj_i. Indeed, 
K £ 5i fl . . . PI Bi implies K £ Tj_i U Tj (which, again, is easily seen by 
induction), and K £" B i+ i implies K £" Tj. 

Since Tj_i C T for odd i, and Tj_i C T\ for even z (for T fl T x =0), the 
lemma follows. □ 

Naturally, the lemma provides an algorithm to check if a given k-mex 
K belongs to the graph: it suffices to check successively if it belongs to 

2 By a slight abuse of language, we say that "an element belongs to Bj" if it is accepted 
by the corresponding Bloom filter. 
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Bi, B 2 , ■ ■ . until we encounter the first B i+ i which does not contain K. Then 
the answer will simply depend on whether % is even or odd. 

In our reasoning so far, we assumed an infinite number of bitmaps B{. Of 
course, in practice we cannot store infinitely many (and even simply many) 
bitmaps. Therefore we "truncate" the construction at some step t and store 
a finite set of bitmaps Bi, B 2 , . . . , B t together with an explicit representation 
of T t . The procedure of Lemma [I] is extended in the obvious way: if for 
all 1 < i < t, K G Bi, then the answer is determined by directly checking 

KeT t . 

3 Memory and time usage 

First, we estimate the memory needed by our data structure, under the 
assumption of infinite number of bitmaps. Let N be the number of "true 
positives", i.e. nodes of Tq. As it was shown in [2], if iV is the number of 
nodes we want to store through a bitmap Bi of size rN, then the expected 
number of critical false positive nodes (set I\) w ih be 8Nc r , where c = 0.6185. 
Then, to store these 8Nc r critical false positive nodes, we use a bitmap B 2 of 
size 8rNc r . Bitmap B3 is used for storing nodes of To which are stored in B 2 
"by mistake" (set T 2 ). We estimate the number of these nodes as the fraction 
c r (false positive rate of filter B 2 ) of N (size of T ), that is Nc r . Similarly, 
the number of nodes we need to put to B A is 8Nc r multiplied by c r , i.e. 
8Nc 2r . Keeping counting in this way, we obtain that the memory needed for 
the whole structure is rN + 8rNc r + rNc r + 8rNc 2r + rNc 2r + ... bits. By 
dividing this expression by N to obtain the number of bits per one k-mer, we 
get r + 8rc r + rc r + 8rc 2r + ... = r(l + c r + c 2r + ...) + 8rc r (l + c r + c 2r + ...) = 
(l + 8c r )j^- F . A simple calculation shows that the minimum of this expression 
is achieved when r = 6.299, and then the minimum memory used per k-mer 
is 9.18801 bits. 

As mentioned earlier, in practice we store only a finite number of bitmaps 
Bi, . . . , B t together with an explicite representation (hash table) of T t . In this 
case, the memory taken by the bitmaps is a truncated sum rN + 8rNc r + 
rNc r + .., and a hash table storing T t takes either 2k ■ Nc^~^ r or 2k ■ 8A^c^l r 
bits, depending on whether t is even or odd. The latter follows from the 
observations that we need to store Nc^^ r (or 8rNc^^ r ) A;-mers, and every 
k-mer takes 2k bits of memory. Consequently, we have to adjust the optimal 
value of r minimizing the total space, and re-estimate the resulting space 
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k 


optimal r 


bits per k-mer 


16 


6.447053 


9.237855 


32 


6.609087 


9.298095 


64 


6.848718 


9.397210 


128 


7.171139 


9.548099 



Table 1: Optimal value of r (bitmap size of a Bloom filter per number of 
stored elements) and the resulting space per fc-mer for t — 4. 

spent on one k-mer. 

Table [1] shows the optimal r and the space per k-mer for t = 4 and several 
values of It demonstrates that even this small t leads to considerable 
memory savings. It appears that the space per k-mer is very close to the 
"optimal" space (9.18801 bits) obtained for the infinite number of filters. 
Table [1] reveals another advantage of our improvement: the number of bits 
per stored k-mer remains almost constant for different values of k. 

We now compare the memory usage of our method compared to the orig- 
inal method of [2]. The data structure of [2] has been estimated to take 
(1.441og 2 (|^|) + 2.08) bits per A;-mer. With this formula, comparative esti- 
mations of space consumption per fc-mer by different methods are shown in 
Table [21 Observe that according to the method of [2J, doubling the value of 
k results in a memory increase by a factor of 1.44, whereas in our method 
the increase is very small, as we already mentioned earlier. 



k 


"Optimal" (infinite) 
Cascading Bloom Filter 


Cascading Bloom Filter 
with t = 4 


Data structure 
from [2] 


16 


9.18801 


9.237855 


12.0785 


32 


9.18801 


9.298095 


13.5185 


64 


9.18801 


9.397210 


14.9585 


128 


9.18801 


9.548099 


16.3985 



Table 2: Comparison of space consumption per fc-mer for the "optimal" (in- 
finite) cascading Bloom filter, finite (t = 4) cascading Bloom filter, and the 
Bloom filter from [2], for different values of k. 



5 



Let us now estimate preprocessing and query times for our scheme. If 
the value of t is small (such as t = 4, as in Table [1]), the preprocessing 
time grows insignificantly in comparison to the original method of [2]. To 
construct each Bi, we need to store Tj_2 (possibly on disk, if we want to save 
on the internal memory used by the algorithm) in order to compute those k- 
mers which are stored in Bi_i "by mistake". The preprocessing time increases 
little in comparison to the original method of [2], as the size of Bi decreases 
exponentially and then the time spent to construct the whole structure is 
linear on the size of To- 

In the case of small t, the query time grows insignificantly as well, as 
a query may have to go through up to t Bloom filters instead of just one. 
The above-mentioned exponential decrease of sizes of Bi results in that the 
average query time will remain almost the same. 
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